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Tensor sparsiflcation via a bound on the spectral norm of 

random tensors 

Nam Nguyen * Petros Drineas ' Trac Tran * 

Abstract 



Given an order- d tensor A G I" xnx, " xn , we present a simple, element- wise sparsifl- 
cation algorithm that zeroes out all sufficiently small elements of A, keeps all sufficiently 
large elements of A, and retains some of the remaining elements with probabilities propor- 
tional to the square of their magnitudes. We analyze the approximation accuracy of the 
proposed algorithm using a powerful inequality that we derive. This inequality bounds 
the spectral norm of a random tensor and is of independent interest. As a result, we 
obtain novel bounds for the tensor sparsiflcation problem. As an added bonus, we obtain 
improved bounds for the matrix (d — 2) sparsiflcation problem. 



1 Introduction 



£> ■ Technological developments over the last two decades (in both scientific and internet domains) 

permit the automatic generation of very large data sets. Such data are often modeled as ma- 
trices, since anmxn real-valued matrix A provides a natural structure to encode information 
about m objects, each of which is described by n features. A generalization of this framework 
iys , permits the modeling of the data by higher-order arrays or tensors (e.g., arrays with more than 

two modes). A natural example is time-evolving data, where the third mode of the tensor 
represents time [Hj. Numerous other examples exist, including tensor applications in higher- 
order statistics, where tensor-based methods have been leveraged in context of, for example, 
Independent Components Analysis (ICA), in order to exploit the statistical independence of 
the sources [QUEUES]- 

A large body of recent work has focused on the design and analysis of algorithms that 
efficiently create small "sketches" of matrices and tensors. Such sketches are subsequently used 
in eigenvalue and eigenvector computations [151 Q] > m data mining applications [251 [23 13 [21] , 
or even to solve combinatorial optimization problems [3[ I10|ITT] . Existing approaches include, 
for example, the selection of a small number of rows and columns of a matrix in order to form 
the so-called CUR matrix/tensor decomposition [12[ l25l 126] . as well as random-projection- 
based methods that employ fast randomized variants of the Hadamard- Walsh transform [29j 
or the Discrete Cosine Transform [27j . 

An alternative approach was pioneered by Achlioptas and McSherry in 2001 [H [2] and 
leveraged the selection of a small number of elements in order to form a sketch of the in- 
put matrix. (A rather straight-forward extension of their work to tensors was described by 
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Tsourakakis in |30|.) Motivated by their work, we define the following matrix/tensor sparsifi- 
cation problem. 



Definition 1. [Matrix/tensor Sparsification] Given an order-d tensor A G 
and an error parameter e > 0, construct a sketch A £ ^ nxnx --- xn of A such that 



raxnx...xn 



A-A 



< e 

2 



and the number of non-zero entries in A is minimized. 

A few comments are necessary to better understand the above definition. First, an order-ci 
tensor is simply a d-way array (obviously, a matrix is an order-2 tensor). We let ||-|| 2 denote 
the spectral norm of a tensor (see Section [2.11 for notation), which is a natural extension of 
the matrix spectral norm. It is worth noting that exactly computing the tensor spectral norm 
is computationally hard. Second, a similar problem could be formulated by seeking a bound 
for the Frobenius norm of A — A. Third, this definition places no constraints on the form of 
the entries of A. However, in this work, we will focus on methods that return matrices and 
tensors A whose entries are either zeros or (rescaled) entries of A. Prior work has investigated 
quantization as an alternative construction for the entries of A, while the theoretical properties 
of more general methods remain vastly unexplored. Fourth, the running time needed to 
construct a sketch is not restricted. All prior work has focused on the construction of sketches 
in one or two sequential passes over the input matrix or tensor. Thus, we are particularly 
interested in sketching algorithms that can be implemented within the same framework (a 
small number of sequential passes). 

We conclude this section by discussing applications of the sparse sketches of Definition [TJ 
In the case of matrices, there are at least three important applications: approximate eigen- 
vector computations, semi-definite programming (SDP) solvers, and matrix completion. The 
first two applications are based on the fact that, given a vector x £ R n , the product Ax 
can be approximated by Ax with a bounded loss in accuracy. The running time of the lat- 
ter matrix-vector product is proportional to the number of non-zeros in A, thus leading to 
immediate computational savings. This fast matrix-vector product operation can then be 
used to approximate eigenvectors and eigenvalues of matrices [H EJ [5] via subspace iteration 
methods; yet another application would be a quick estimate of the Krylov subspace of a ma- 
trix. Additionally [4, 9] argue that fast matrix-vector products are useful in SDP solvers. 
The third application domain of sparse sketches is the so-called matrix completion problem, 
an active research area of growing interest, where the user only has access to A (typically 
formed by sampling a small number of elements of A uniformly at random) and the goal 
is to reconstruct the entries of A as accurately as possible. The motivation underlying the 
matrix completion problem stems from recommender systems and collaborative filtering and 
was initially discussed in [6]. More recently, methods using bounds on A — A and trace min- 
imization algorithms have demonstrated exact reconstruction of A under - rather restrictive 
- assumptions 0[8]. Finally, similar applications in recommendation systems, collaborative 
filtering, monitoring IP traffic patterns over time, etc. exist for the d > 2 case in Definition [TJ 
see [301 123 [26] for details. 



1.1 Our algorithm and our main theorem 

Our main algorithm (Algorithm 1) zeroes out "small" elements of the tensor A, keeps "large" 
elements of the tensor A, and randomly samples the remaining elements of the tensor A with 
a probability that depends on their magnitude. The following theorem is our main quality-of- 
approximation result for Algorithm 1. 



1: Input: order-d tensor A € ]R nxn --- xn ) 


sampling parameter s. 




For all ii, ..., i^ & [n] X . . . X [n] do 






• If A\ t < ln2 d/ n 2 Ufp then 

H—ld — n d l 2 s 


•A-i\...i d = U, 




• Elself A? i > ll " 4 "^ then 


A- — A- ■ 




• Else 




sA 2 ■ 
■id 


1 ^i— » 
Ai x ...i d = { w i-'<« 
1° 


,with probability Pi 1 ...i d = 
,with probability 1 — p^. 


2: Output: Tensor A £M. nxn - xn . 







Algorithm 1: Tensor Sparsification Algorithm 

Theorem 1. Let A £ R nx — xn ^ e an order-d tensor and let A be constructed as described in 
Algorithm 1. Assume that n > 300 and 2 < d < 0.5 Inn. // the sampling parameter s satisfies 

( d 3 8 2d st (A) n d l 2 In 3 n \ 



then, with probability at least 1 — n , 



A-A 



< e 

2 



In the above st(A) is the stable rank of the tensor A, i.e., 



st (A) 



The following corollary focuses on the matrix case (d = 2) and is an immediate consequence 
of Theorem [TJ 

Corollary 1. Let A G M nxn (assume n > 300J be a matrix and let A be constructed as 
described in Algorithm 1. If the sampling parameter s satisfies 

.'at (A) n In 3 n 
s = 11 I -2 



then, with probability at least 1 — n 



A-A 



< e 

2 



In the above st(A) is the stable rank of the matrix A, i.e., 



st (A) 



In both Theorem [T] and Corollary[TJ A has, in expectation, at most 2s non-zero entries and the 
construction of A can be implemented in one pass over the input tensor/matrix A. Towards 
that end, we need to combine Algorithm 1 with the Sample algorithm presented in Section 

4.1 of [2]. Finally, in the context of Definition [JJ our result essentially shows that we can get 
a sparse sketch A with 2s non-zero entries. 

1.2 Comparison with prior work 

To the best of our knowledge, for d > 2, there exists no prior work on element- wise tensor 
sparsification that provides results comparable to Theorem[TJ (It is worth noting that the work 
of [30] deals with the Frobenius norm of the tensor, which is much easier to manipulate, and 
its main theorem is focused on approximating the so-called HOSVD of a tensor, as opposed 
to decomposing the tensor as a sum of rank-one components.) 

For the d = 2 case, prior work does exist and we will briefly compare our results in 
Corollary [JJ with current state-of-the-art. In summary, our result in Corollary [1] outperforms 
prior work, in the sense that, using the same accuracy parameter e in Definition[TJ the resulting 
matrix A has fewer non-zero elements. In [TJ [2] the authors presented a sampling method that 
requires at least 0(st(^4)nln n/e 2 ) non-zero entries in A in order to achieve the proposed 
accuracy guarantee. Our result reduces the sampling complexity by a modest, yet non-trivial, 
O(lnn) factor. It is harder to compare our method to the work of [5], which depends on the 
Y17j=i \Aij\- The latter quantity is, in general, upper bounded only by n \A\ F , in which case 
the sampling complexity of [5] is much worse, namely 0(st (A)n 3 ' 2 /e). However, it is worth 
noting that the result of [5] is appropriate for matrices whose "energy" is focused only on a 
small number of entries, as well as that their bound holds with much higher probability than 
ours. Finally, while preparing this manuscript, the results of [T7] were brought to our attention. 
In this paper, the authors study the ||-|| 00 _ > 2 an d IHIoo->i norms hi the matrix sparsification 
context. The authors also present a sampling scheme for the problem of Definition [TJ Their 
theoretical analysis is not directly comparable to our results, since the sampling complexity 
depends on the average of the ratios AfJ max-ij Afj. 

1.3 Bounding the spectral norm of random tensors 

An important contribution of our work is the technical analysis and, in particular, the proof 
of a bound for the spectral norm of random tensors that is necessary in order to prove The- 
orem [TJ It is worth noting that all known results for the d = 2 case of Theorem QJ are either 
combinatorial in nature (e.g., the proofs of [TJ [2] are based on the result of [16], whose proof 
in fundamentally combinatorial) or use simple e-net arguments [5]. To the best of our un- 
derstanding neither approach can be immediately extended to the d > 2 case, which requires 



novel tools and methods. Indeed, we are only able to prove the following theorem using the 
so-called entropy- concentration tradeoff, an analysis technique that has been recently devel- 
oped by Mark Rudelson and Roman Vershynin [281 I32j . The following theorem presents a 
spectral norm bound for random tensors and is fundamental in proving Theorem [TJ 

Theorem 2. Let A £ flj nx --- xn fr e an order-d tensor and let A be a random tensor of the same 
dimensions such that KA = A. For any q > 1, 



l\* 



E 



A-A 



< Cl 8*(Vd^ +V - q ) Y J ^ i max £-4U- lW+ , 



3=1 \h =1 



Here c\ is a small constant. 

An immediate corollary of the above theorem emerges by setting tensor A to zero. 

Corollary 2. Let B G R nx — xn be a random order-d tensor, whose entries are independent, 
zero-mean, random variables. For any q > 1, 



i 

2\ ~ q 



Id In 

(E||Bf)i < Cl 8 d (Vdl^l+^q) £)E. .max . \J2 B l-i j - 1 i j i j+1 ...i d 
N /I ^— * ii,...,ij-i,i j+ i,...,i d \ *—~i 3 3 J + 

Here c\ is a small constant. 

2 Preliminaries 

2.1 Notation 

We will use the notation [n] to denote the set {1,2, ... ,n}. Co, ci, C2, etc. will denote small 
numerical constants, whose values change from one section to the next. EX will denote the 
expectation of a random variable X. When X is a matrix, then EX denotes the element-wise 
expectation of each entry of X. Similarly, Var [X] denotes the variance of the random variable 
X and P (£) denotes the probability of event £.. Finally, In a? denotes the natural logarithm of 
x and log 2 x denotes the base two logarithm of x. 

We briefly remind the reader of vector norm definitions. Given a vector x G W 1 the £2 
norm of x is denoted by \\x\\ 2 and is equal to the square root of the sum of the squares of the 
elements of x. Also, the £q norm of the vector x is equal to the number of non-zero elements 
in x. Finally, given a Lipschitz function / : M. n i->Mwe define the Lipschitz norm of / to be 

SUP I/W-/MI 



,y£K. n \\X — y\\2 

For any d-mode or order-d tensor A G R nx --- Xn ) its Frobenius norm ||«4||^ is defined as the 
square root of the sum of the squares of its elements. We now define tensor- vector products 
as follows: let x,y be vectors in 1™. Then, 



J\ X \ X — / j • / ^-ijk...£%ii 



i=X 



Ax 2 x = y^ j Aijk...ex j , 



3=1 



Ax 3 x = y^Ajjk..£Xk, etc. 



fc=i 

Note that the outcome of the above operations is an order-(d— 1) tensor. The above definition 
may be extended to handle multiple tensor- vector products, e.g., 

n n 

A xi x x 2 y = } j } j Ajjk..ix i yj. 
i=i j=i 

Note that the outcome of the above operation is an order-(<i — 2) tensor. Using this definition, 
the spectral norm of a tensor is defined as 

||^4|| 2 = sup \Ax 1 x 1 ... x d x d \, 
x 1 ...x d eR n 

where all the xi E R n are unit vectors, i.e., ||xi|| 2 = 1 for all i £ [d\. It is worth noting that 
A X\ x\ . . . XdXd £l and also that our tensor norm definitions when restricted to matrices 
(order-2 tensors) coincide with the standard definitions of matrix norms. 

2.2 Measure concentration 

We will need the following version of Bennett's inequality. 

Lemma 1. Let X\, X<i,..., X n be independent, zero-mean, random variables and assume that 
\Xi\<l. For anyt>l Y2=l Var PQ > 0: 



J2xi>t) < 



e- 1 ' 2 . 



This version of Bennett's inequality can be derived from the standard one, stating: 



]TX 4 >i < e - 2 K^ 2 ). 



vi=l 



Here a 2 = YJl=i ^ar(Xj) and h(u) = (1 + u) ln(l + u) — u. Lemma [T] follows using the fact 
that h(u) > u/2 for u > 3/2. 

We also remind the reader of the following well-known result on measure concentration 
(see, for example, eqn. 1.4 of [23]). 

Lemma 2. Let f : M. n i— >■ R be a Lipschitz function and let \\f\\ L be its Lipschitz norm. If 
g £ R n is a standard Gaussian vector (i.e., a vector whose entries are independent standard 
Gaussian random variables), then for all t > 

f(g)>Ef(g) + tV2\\f\\ L )<e- t2 . 



The following lemma, whose proof may be found in the Appendix, converts a probabilistic 
bound for the random variable X to an expectation bound for X q , for all q > 1, and might 
be of independent interest. 

Lemma 3. Let X be a random variable assuming non-negative values. For all t > and 
non-negative a, b, and h: 

(a) If¥(X >a + tb)< e~ t+h , then, for allq>\, 

EX q < 2(a + bh + bq) q . 

(b) IfF(X >a + tb)< e~ t2+h , then, for all q>\, 

EX q < 3y/q(a + bVh + by/q/2) q . 



Finally, we present an e-net argument that we will repeatedly use. Recall from Lemma 3.18 
of |22| that the cardinality of an e-net on the unit sphere is at most (1 + 2/e) n . The following 
lemma essentially generalizes the results of Lecture 6 of [31] to order-d tensors. 

Lemma 4. Let N be an e-net for the unit sphere S n_1 in W 1 . Then, the spectral norm of a 
d-mode tensor A is bounded by 

i \ d-i 



< sup \\Axxxx... x d -iXd-i\\ 2 - 

V 1-6 / xi...z d _ieN 

Notice that, using our notation, A Xi x\ . . . x^-i x^-i is a vector in K n . The proof of the 
lemma may be found in the Appendix. 

3 Bounding the spectral norm of random tensors 

This section will focus on proving Theorem [21 which essentially bounds the spectral norm of 
random tensors. Towards that end, we will first apply a symmetrization argument following 
the lines of [18]. This argument will allow us to reduce the task-at-hand to bounding the 
spectral norm of a Gaussian random tensor. As a result, we will develop such an inequality by 
employing the so-called entropy-concentration technique, which has been developed by Mark 
Rudelson and Roman Vershynin [28l [32] . 

For simplicity of exposition and to avoid carrying multiple indices, we will focus on prov- 
ing Theorem [2] for order-3 tensors (i.e., d = 3). Throughout the proof, we will carefully 
comment on derivations where d (the number of modes of the tensor) affects the bounds of 
the intermediate results. Notice that if d = 3, then a tensor A E ]j nxnx ™ may be expressed as 

A= y^ Aijk ■ &i <8) ej (8) efc. (1) 

i,j,k=l 

In the above, the vectors e, G M n (for all i € [n]) denote the standard basis for IR n and (8> 
denotes the outer product operation. Thus, for example, ei &) ej (8) e& denotes an tensor in 
jgmxnxji -^Qgg (i^j, k)-th entry is equal to one, while all other entries are equal to zero. 



3.1 A Gaussian symmetrization inequality 



The main result of this section can be summarized in Lemma [5j In words, the lemma states 
that, by losing a factor of \/27r, we can independently randomize each entry of A via a Gaussian 
random variable. Thus, we essentially reduce the problem of finding a bound for the spectral 
norm of a tensor A to finding a bound for the spectral norm of a Gaussian random tensor. 

Lemma 5. Let A € ]^ nxnxn fr e anv order-3 tensor and let A be a random tensor of the same 
dimensions such that E A A = A. Also let the gijk be Gaussian random variables for all triples 
(i,j,k) G [n] x [n] x [n\. Then, 

q 



E 



A 



A-A 



9 < (V^y^A^g 



/I OijkAijk ■ e, (g> ej <g> e k 



(2) 



Proof. Let A' be an independent copy of the tensor A. By applying a symmetrization argu- 
ment and Jensen's inequality, we get 

9 



E A 



A-A 



E A \\A - E A A\\l = E A \\A - E A iA\\l < E A E A , \\A - A 



I'll? 



Note that the entries of the tensor A — A' are symmetric random variables and thus their 
distribution is the same as the distribution of the random variables eyfc ( Aijk — -A^k ) > where 
the ejjfc's are independent, symmetric, Bernoulli random variables assuming the values +1 and 
— 1 with equal probability. Hence, 

y 



E^E^H^-^li; = E A E A ,E e 



< 2 q ~ 1 E A E e 



+ 2 </_1 E^/E e 



^ tijk {Aijk - Ajk) e i ® e j ® e k 

q 

/ j (-ijk-A-ijk&i ® Cj ® Cfc 
i,3,k 



^ e ijk A' ijk ei <8> ej (8) e fc 



The last inequality follows from the submultiplicativity of the tensor spectral norm and the 
inequality (x + y) q < 2 9_1 (x IJ + y 9 ) with q > 1. Now, since the entries of the tensors A and 
.A' have the same distribution, we get 



E^E^/ \\A - A'\\l < 2 9 E^E 



/ j €ijk-A-ijk e i ® e j ® Cfc 



(3) 



We now proceed with the Gaussian symmetrization argument. Let g^ k for all i,j, and A; 
be independent Gaussian random variables. It is well-known that E|</jjfc| = y2/vr . Using 
Jensen's inequality, we get 



E^E e 



/ j ^ijkAijk&i <8> ej (8> e k 

i,j,k 



7T\9/2 



E^E e 



^ eij k Aij k (Eg \g ijk \) ■ ei ej 8) e fc 



< 



7T\?/ 2 . 



E^E £ E 9 



7T\9/2 

-I ^ A E g 



/ e ijk-A.ijk \gijk\ • c« ® ej (8> e& 

9 



/ 9ijk-A. 



ink ' Cj 



Cfr 



*,i,fc 



The last equality holds since ejjfc \gijk\ an d ffyfc have the same distribution. Thus, combining 
the above with eqn. ([3]) we have finally obtained the Gaussian symmetrization inequality. □ 

3.2 Bounding the spectral norm of a Gaussian random tensor 

In this section we will seek a bound for the spectral norm of the tensor % whose entries %ijk 
are equal to gijkAijk (we are using the notation of Lemma [5]). Obviously, the entries of Ti are 
independent, zero-mean Gaussian random variables. We would like to estimate 

E g \\H\\ q = E g sup\\H x 1 xx 2 y\\ q 2 
x,y 

over all unit vectors x,y £ M. n . Our first lemma computes the expectation of the quantity 
\\7i x% x x 2 y\\ 2 for a fixed pair of unit vectors x and y. 

Lemma 6. Given a pair of unit vectors x and y 



E g \\H Xix x 2 y|| 2 < /max y^ A\ 



ijk- 



Proof. Let s = ^X!2;X2j/6l n and let s k = ]T\ • rtijkXiyj for all k € [n\. Thus 



ii ^ ill = J2[Yl n v kXi ^ 

k \ i,j 

= / j '^ijk x iVj ~r / j / j liijkiLpqkXiyjXpyq- 
i,j,k k i,j^p,q 

Using E g Hijk = and E g rtf jk = A 2 ijk E g gf jk = A 2 jk we conclude that 

Z) x * Yl Vj Yl A lk < max ^ A% k . 



TO II ||2 _ V^ y.2 2 2 

^9 Il s ll2 — / j^iik^iVj 
i,j,k 



'■J 



The last inequality follows since ||x|| 2 = ||y|| 2 = 1- Using E g \\s\\ 2 < v/E s ||s|| 2 we obtain the 
claim of the lemma. □ 

The next lemma argues that \\7i Xi x x 2 y\\ 2 is concentrated around its mean (which we 
just computed) with high probability. 

Lemma 7. Given a pair of unit vectors x and y 



\H xi x x 2 y\\ 2 > |max^i| fc + tV2 max \Aijk\ < e 



'j 



(4) 



Proof. Consider the vector s = %x 1 xx 2 yEM n and recall that Hijk = gijk^ijk to get 

i,j,k 
= Yl [YjUiikXiVj e k 

k \ i,j } 

— / I / _, 9ijk-^Hjk^iV 'j I 6fc* 
k \ i,j J 

In the above the e k for all k E [n] are the standard basis vectors for M. n . Now observe that all 
gijk-AijkXii/j are Gaussian random variables, which implies that their sum (over all i and j) is 
also a Gaussian random variable with zero mean and variance Y~\- „■ Af^xfy?. Let 

9fc = ^A% k x\y) for all fc E [n] 

and rewrite the vector s as the sum of weighted standard Gaussian random variables: 

s = y^z k q k e k . 

k 

In the above the z k s are standard Gaussian random variables for all k € [n]. Let 2; be the 
vector in M. n whose entries are the z k s and let 



/(*) 



^ ^fc^efc 



We apply Lemma [2] to /(^). First, 

-f2/ \ Y~^ 2 2 ^ 11 11 2 2 

fc 

We can now compute the Lipschitz norm of /: 

I V /2 / V /2 

II/IIl = m ,ax|<? fc | = max V^ 2 fc x 2 y 2 < max|Ajfc| T^xfy] < max|Ai*| 

' k k \ * — » J J I i,j,k \ * — * / i,j,k 

From Lemma [2] and Lemma [6] we conclude 



\U x 1 xx 2 y\\ 2 > Jmax.y^ j Af ik + tmBx\Aijk\ | <e <2/2 . 



i,j * — • ■> i.j.k 

k 



D 



10 



3.2.1 An e-net construction: the entropy- concent rat ion tradeoff argument 

Given the measure concentration result of Lemma one might be tempted to bound the 
quantity \\H X-i x x 2 y\\ for all unit vectors x and y by directly constructing an e-net N on the 
unit sphere. Since the cardinality of N is well-known to be upper bounded by (l + -) , it 
follows that by getting an estimate for the quantity \\H Xi x X2 y|| for a pair of vectors x and 
y in N and subsequently applying the union bound combined with Lemma HI an upper bound 
for the norm of the tensor T~L may be derived. Unfortunately, this simple technique does not 
yield a useful result: the failure probability of Lemma [7J is not sufficiently small in order to 
permit the application of a union bound over all vectors x and y in N. 

In order to overcome this obstacle, we will apply a powerful and novel argument, the so- 
called entropy- concentration tradeoff, which has been recently developed by Mark Rudelson 
and Roman Vershynin |28[ [32] . To begin with, we express a unit vector x G W 1 as a sum of 
two vectors z, w G M. n satisfying certain bounds on the magnitude of their coordinates. Thus, 
x = z + w, where, for all i G [n], 

Zi = 



Xi 


if | T .| > 1 
VAn 





, otherwise 


X-i 


' if I s *! < v^ 





, otherwise 



W: 



In the above A G (0, 1] is a small constant that will be specified later. It is easy to see that 
\\z\\ 2 < 1, || w || 2 < 1, and that the number of non-zeros entries in z (i.e., the £q norm of z) is 
bounded: 

IMIg < An. 

Essentially, we have "split" the entries of x in two vectors: a sparse vector z with a bounded 
number of non-zero entries and a spread vector w with entries whose magnitude is restricted. 
Thus, we can now divide the unit sphere into two subspaces: 

-82,0 = \ x G K n : ||x|| 2 < 1, \xi\ > —^= or x,; = 



B 2 ,oo = {x£R n : \\x\\ 2 < 1, HasJIoo < 



<\n. 

Given the above two subspaces, we can apply an e-net argument to each subspace separately. 
The advantage is that since vectors on l?2,o only have a small number of non-zero entries, the 
size of the e-net on i?2,o is small. This counteracts the fact that the measure concentration 
bound that we get for vectors in i?2,o is rather weak since the vectors in this set have arbitrarily 
large entries (upper bounded by one). On the other hand, vectors in -E>2 j0 o have many non-zero 
coefficients of bounded magnitude. As a result, the cardinality of the e-net on -B2 )00 is large, 
but the measure concentration bound is much tighter. Combining the contribution of the 
sparse and the spread vectors results to a strong overall bound. 

We conclude the section by noting that the above two subspaces are spanning the whole 
unit sphere S™" 1 in W 1 . Using the inequality (E(x + y) q ) 1/q < (Ex q ) 1/q + (Eyi) 1/q we obtain 

( \ l/q ( \ ll<1 

E sup ||^xixx 2 y||^ < E sup ||%xixx 2 y||2 (5) 

\ z^es™- 1 / \ z,y6B 2 ,o / 

11 



1/9 



+ 



+ 



+ 



E SUp \\% X! X x 2 j/Hl 



E sup ||% Xisx 2 j/ 1 1 § 

xeB 2 , ,yeB 2)ao . 



E sup ||% Xixx 2 2/ 1 1 § 

. ^eB2,oo,j/e-B2,o . 



i/y 



1/9 



(6) 
(7) 

(8) 



3.2.2 Controlling sparse vectors 

We now prove the following lemma bounding the contribution of the sparse vectors (term ([5|)) 
in our e-net construction. 

Lemma 8. Consider a d-mode tensor A and let % be the d-mode tensor after the Gaussian 
symmetrization argument as defined in Section \3.2\ Let a and f3 be 



a 



max < 



max V A% k , max V A 2 ijk , max V A 

1,7 *■ — * J i,k *— ' J i,k i — ' 



2 

ijh 



fc=l 



t=l 



P 



max l^tjfcl 



For all q > 1, 



(9) 
(10) 

(11) 



E sup \\H xix x 2 y||| < Z^/q[2 d - 1 (a + fiy / 2d\nln(5e/\) + Py/q/2 
x,y£B 2 ,o ^ ^ 

Proof. Let K = An and let -62,0,^ be the K-dimensional subspace defined by 

B 2 fl,K = {x£R K : \\x\\ 2 < 1}. 

Then, the subspace l?2,o corresponding to vectors with at most K non-zero entries can be 
expressed as a union of subspace of dimension K, i.e., #2,0 = U^2,o,A'- A simple counting 
argument now indicates that there are at most {^) < (^) such subspaces. 

We now apply the e-net technique to each of the subspaces B^^^k whose union is the 
subspace i?2,o- First, let us define Nj3 2 K to be the 1/2-net of a subspace -B2,o,A"- Lemma 3.18 
of [22] bounds the cardinality of Nb 2 k by 5^. Applying Lemma 0] with e = 1/2 we get 



sup \\H xi x x 2 y\\ 2 < 2 
x,y&B 2y o,K 



d-l 



sup ||% Xi x x 2 2/H-2 . 



z,2/eN s 



Combining the above with Lemma [7] which bounds the term T-i x\ x X2 y for a specific pair of 
unit vectors x and y and taking the union bound over all x, y € Ny3 2 K , we get 



sup 11% Xi x x 2 y|| 2 > 2 d 1 [a 

, x,yS-B 2 ,o,x 



+ tV2/3)] <(5^V 42 . 



In the above a and (3 are defined in eqns. © and (jlOp respectively. We now explain the 



v d-l 



(5 ) term in the failure probability. In general, the product % x 1 x x 2 y ■ ■ ■ should be 
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evaluated on d — 1 vectors x, y, Recall that the 1/2-net N# 2 K contains 5 K vectors and 

thus there is a total of (5^) possible vector combinations. A standard union bound now 
justifies the above formula. Finally, taking the union bound over all possible subspaces -B2,o,K 
that comprise the subspace I?2,o and using K = Xn yields 



S |Kx..x,,| 1 >J t '(. + h*) < ((f)") (5-) 



d-1 



,x,yeB 2 fi 



= (^ 



Xn(d-1) 



V Ay 



e 



(r- \ And 
t) -" <*> 

In the above, we again accounted for all d— 1 modes of the tensor and also used d—1 < d. Using 
eqn. f)12|) and applying Lemma[3j part b with a = 2 a, 6 = 2 /3\/2, and /i = a!Anln(5e/A) 
we get 

E sup ||ft xi x x 2 y||| < 3^/q (2 d_1 (a + P^2dXnln(5e/X) + /Jy 7 ^ 

a 

3.2.3 Controlling spread vectors 

We now prove the following lemma bounding the contribution of the spread vectors (term (0)) 
in our e-net construction. 

Lemma 9. Consider a d-mode tensor A and let 1-L be the d-mode tensor after the Gaussian 
symmetrization argument as defined in Section \3. 2\ Let a be defined as in eqn. |5jJ. For all 

E sup \\n x 1 x ■x 2 y\\ q 2 <3 v / g(2 2( ^ 1) ( a + 4a^/dln (e/A) + 0.5a A /-^- ) ) , (13) 
x,veB 2>ao " V V V XnJ J 

assuming that 16 An > In 2. 

It is worth noting that the particular choice of the lower bound for An is an artifact of the 
analysis and that we could choose smaller values for An by introducing a constant factor loss 
in the above inequality. 

Proof. Our strategy is similar to the one used in Lemma[8j However, in this case, the construc- 
tion of the g-net for the subspace I?2,oo is considerably more involved. Recall the definition of 

B 2 ,oo = \xeR n : \\x\\ 2 < 1, Haslloo < 



'Xn 
We now define the following two sets of vectors Ni and N 2 : 

Ni = {z G B2 oo '■ for all i £ \n], Zi = ± — == or z« = 0), 

L ' 2VA^ 
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N2 = {z £ i?2,oo : for all i £ [n], Zi = ± — j= or zi = 0}. 

4yAn 

Our Tj-net for I?2,oo will be the set 

^B 2 x = {z £ B 2)0 o ■ for all i £ [n], m = ± — 7= or z, = ±—7= or Zi = 0}, 

2VAn 4VAn 

Our first lemma argues that Nb 2 x is indeed a ^-net for i?2,oo- 
Lemma 10. For all x £ .62,00 there exists a vector z £ Nb 2 x such that 

1 

\\x — z I < 



x ~ 2VAn 



Proof. Consider a vector 1 6 Bonn with coordinates x; for all i £ In] . If — 7^= < \xA < 

1 L - l O-\/\'V) 11 



2VAn 



= then we set Zi = sienfx,) — 7==. Similarly, if — }=- < \xA < — 7= then we set z; 
gn(xj) j-t=. Finally, if |xj| < 
and implies that for all i 6 [n], 



sign (xi) , . Finally, if \xi\ < , then we set Zi = 0. This choice of z is clearly in 



1 



2VAn 
which concludes the lemma. □ 

Given our definitions for Ni , N2 and N# 2 ^ , it immediately follows that any vector in N^ 2 ^ 
can be expressed as a sum of two vectors, one in Ni and one in N2. Combining the above 
lemma with Lemma HI we get 

sup 11% Xi x X22/H2 — ^ dl SU P 1 1 ^-^ x 1 ic x 2 2/ 1 1 2 

< 2 d ~ 1 sup 11% xi x x 2 y\\ 2 + sup ||% xi x x 2 y|| 2 (14) 

Vx,j/eNi x,yeN 2 J 

+ 2 d_1 sup ||% xi x x 2 2/H2 + sup ||% xi x X 2 yyi5) 

YxeNi,j/eN2 xeN 2 ,?/eNi J 

It is important to note that the last inequality would have a total of 2 d ~ 1 terms (as opposed 
to four) for the general case of order-d tensors. Our final bound accounts for all these terms 
and we will return to this point later in this section. Our next lemma bounds the number of 
vectors in Ni and N2. 

Lemma 11. Given our definitions for N x and N 2 , |Ni| < e 4A " ln ( e M) and |N 2 | < e 16A " ln ( e / A ). 

Proof. We focus on Ni; the proof for N2 is similar. For all z £ Ni, the number of non-zero 
entries in z is at most 4An, since ||z|| 2 < 1. Let 7 = 4An and notice that the number of 
non-zero entries in z (the "sparsity" of z, denoted by s) can range from 1 up to 7. For each 
value of the sparsity parameter s, there exist 2 s (™) choices for the non-zero coordinates ((") 
positions times 2 s sign choices). Thus, the cardinality of Ni is bounded by 



in.i ,£(:>• 



< / ^ en ) - e 7 ln (2e™/7) — e 4Anln(e/2A) < 4Anln(e/A) 
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□ 

We now proceed to estimate the quantity ||%Xixx 2 y|| 2 over all vector combinations that 
appear in eqns. (fHj) and (fT5j) . 



Lemma 12. Using our notation, 

< e -t 2 +16dAraln(e/A) 



(sup \\H Xi x x 2 y|| 2 > a + iv2 — -^=a 



The same bound holds if the sup is evaluated over all pairs of vectors x, y £ N 2 or over all 
pairs of vectors x G Ni,y € N 2 or over a// pairs o/ vectors x G N 2 , y € Ni- 

Proof. We only prove the bound for x,y G Ni, since the other three bounds can be proven 
in the same way. (It is worth noting here that slightly different - and sometimes tighter, at 
least up to constants - bounds emerge for the other three cases, but presenting them would 
further complicate our presentation and thus are omitted.) We can exactly follow the proof 
of Lemma [7J The only difference is that the bounds for the Lipschitz norm of the function / 
must be modified. More specifically, in this case, 

2 _ ^ a „ I V^ A 2 ;2.;2 

k 



= max l^ xj Y^ v]Ajk 
\ i J , 



In the above we used the fact that llylL < 1 and llxll = - \= . We continue by replicating 

nz moo 2vAn 

the proof of Lemma [7J in order to get (recall the definition of a from eqn. ©): 



P \\n xix x 2 y| 



>a + tV2 — —a)<e~ t2 . (16) 

2^ J ~ 



Taking the union bound over all possible combinations of vectors x and y in Ni and using 
Lemma [TT1 we get 

P ( sup pi x lX x 2 y\\ 2 >a + tV2—?=a ) < e -t 2 +(d-i)^n\n(e/x) ^ 

where the {d — 1) factor appears in the exponential because of a union bound over all d — 1 
vectors that could appear in the product % x i x x 2 y 2 • • • . Proving the lemma is now trivial 
using 4(d— 1) < 16d; this bound is chosen in order to hold for all four combinations of vectors 
x and y in f% and/or N 2 . □ 
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Using the bounds of Lemma [12] we get 

52 sup\\H x lX x 2 y\\ 2 > 2 d - 1 ( a + tV2^=a) ) < 2 ^i e -* 2 +^Anin( e /A) 



w(as,v)6(Ni,Na) *' V 



tV2^=a) 
2VAn J J 



In the above, the sum is over all possible combinations of vectors x,y, . . . in Ni and/or N 2 . 
In general, there are at most 2 d ~ l such combinations (thus, when d = 3 there exist only 
four combinations). We are now ready to apply Lemma [3] with a = 2 a, b = j= a and 
h = Wd\n In (e/A) + (d - 1) In 2 to get 



E ^ sup ||^ Xicc x 2 2/H2 } < 3^9 (a + bVh + by/q/2) . 



K (x,y)e(Jii^2) x,y 
Combining with eqns. ()14[) and (|15p . we derive 

E sup \\ / Hx 1 xx 2 y\\ q 2 <3^q(2 d - 1 (a + b\^h + by/q/2 
x,y€B 2 , ao V V 

We now proceed by substituting the values of a, b, and h in the above equation. To conclude 
the proof of Lemma [9] we note that under our assumption on An and using the fact that A is 
between zero and one, we get (d — 1) In 2 < 16<iAnln (e/A). □ 

3.2.4 Controlling combinations of sparse and spread vectors 

We now prove the following lemma bounding the contribution of combinations of sparse and 
spread vectors (terms © and (|8])) in our e-net construction. 

Lemma 13. Consider a d-mode tensor A and let % be the d-mode tensor after the Gaussian 
symmetrization argument as defined in Section \3.2l Let a be defined as in eqn. f5j). For all 

E sup \\H xix x 2 y\\ q 2 < 3^/q[2 2d - 3 ( a + 4ay/d\n (5e/A) + 0.5aW-p- ) ) , (17) 

xeB 2>oa , y eB 2 ,o " V V V An// 

assuming that 15An > (In 2) / (1 + In 5). 

It is worth noting that the particular choice of the lower bound for An is an artifact of the 
analysis and that we could choose smaller values for An by introducing a constant factor loss 
in the above inequality. 

Proof. Let x £ B 2tOC and y G -£>2,o- I n Sections 13.2.21 and 13.2.31 we defined Nb 20 (a 1/2-net 
for .62,0) and N_b 2oo (a 1/2-net for #2,00)- Recall that for K = An, B 2: q was the union of (]£•) 
.ftT-dimensional subspaces B 2t o t x- Consequently, the 1/2-net Nb 20 i s the union of the 1/2-nets 
N_b 20K (each Nb 2 k is the 1/2-net of £2,0,^)- Recall from Section [3.2.21 that the cardinality 
of N# 2 i s bounded by 



KJ ' ±J2AA| - \KJ \ A 
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We apply Lemma 0] to get 

sup \\T-L Xi x x 2 y\\ 2 < 2 d ~ l sup \\H Xi x x 2 y\\ 2 ■ (19) 

It is now important to note that for a general d-mode tensor T~L the above product % X\x x 2 
y X3 • • • would be computed over d — 1 vectors, with at least one those vectors (w.l.o.g. x) in 
Nb 2 oo and at least one of those vectors (w.l.o.g. y) in Nb 2 . Each of the remaining (d — 3) 
vectors could belong either to Nb 20 or to Nb 2oc . In order to proceed with our analysis, we 
will need to further express the vectors belonging to N# 2 K asa sum of two vectors belonging 
to Ni and N 2 respectively. (The reader might want to recall our definitions for Ni and N 2 
from Section [3. 2. 31 ) Note that since at most (d — 2) vectors in the product % Xi x x 2 y X3 • • • 
might belong to Nb 2oo , this product can be expressed as a sum of (at most) 2 terms as 
follows: 

U X\x x 2 y--- =% X\X\ x 2 y VH. x x x 2 x 2 y--- 

where x E Nb 2 ^,,1/G Nb 2 , x\ € Ni, and x 2 G N 2 . We now need a bound for the £ 2 norm for 
each of the 2 d ~ 2 such terms. Fortunately, this bound has essentially already been derived in 
Section 13.2.31 We start by noting that the bound of eqn. (fT6f) holds when at least one of the 
vectors in the product H X\ x x 2 y ■ ■ ■ belongs to Ni or N 2 . Thus, the same bound holds for 
each of the terms that we seek to bound, since each of the terms has at least one vector in Ni 
or N 2 , namely 

F(\\Hx 1 x i x 2 y--- |L > a + tV2 — =a ) < e'^ 
V 2\/A^ / 

holds for i = 1 or i = 2. We now proceed to upper bound the 

sup ||% XiXi x 2 y--- || 2 (20) 

where Xj is either in Ni or N 2 , y G N# 2 and the remaining (d — 3) vectors are either in Ni or 
in N 2 or in Nb 20 - We apply a union bound by noting that from Lemma [11] the cardinalities 
of Ni and N 2 are upper bounded by e 16Anln ( e / A ). Combining with eqn. (|18j) we get that the 
total number of possible vectors over which the sup of eqn. (I20p is computed does not exceed 

V 

16Anln(e/A) j < 17dXn ln(5e/A) 

Note that it is not necessary to get a particularly tight bound on the total number of vectors, 
hence the rather loose d factors that appear in the exponents. A more careful counting 
argument might result to slightly better constants, but for the sake of clarity we choose to 
proceed with the above bound. We can now use a standard union bound to get 

sup\\Hx 1 x i x 2 y---\\ 2 >a + tV2—!=a) < e -* 2 +l™nln(5e/A)_ 

2yAn / 

Recall that the quantity sup :reNs . y eN B 11^ x i x x 2 U " • • || 2 of eqn. (fT9l) could have up to 
2 d ~ 2 terms of the above form and thus 




+ tV2—^=a) ) 
2\f\n^ J J 



sup \\nx 1 xx 2 y...\\ 2 >2 d - 2 [a + tV2^=a) ] 2 d-2 e -t^l7dXn M5 e/X) ^ 

y xen B 2t00 ,yen B2fi 
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We are now ready to apply Lemma [3] with a = 2 d 2 a, b = y= a, and h = \7d\n In (5e/A) • 
(d - 2) In 2 to get 



E sup ||% Xi x x 2 y ■ 

Combining with eqn. (I19p we get 



< 3^/q (a + bVh + b^/q/2 



< 3^/q (2 d ~ 1 (a + bVh + fr^^T 2 " 



El sup 11% Xi x x 2 y ■ ■ ■ || 2 

We now proceed by substituting the values of a, b, and h in the above equation. To conclude 
the proof of the lemma we note that under our assumption on An and using the fact that A is 
between zero and one, we get (d — 2) In 2 < 15<iAnln (5e/A). □ 



3.2.5 Concluding the proof of Theorem [2] 

Given the results of the preceding sections we can now conclude the proof of Theorem [2j We 
combine Lemmas El [H and [TBI in order to bound terms ([5]), ([6]), ((ZJ), and (JHJ). First, 



(E||%|| 2 / ) 1 /9<(3^) 1A / 



td-l 



a 



+ (3^2d\n ln(5e/A) + (3^/q/2 



+ 2 



.d-l 



l) x (3^) 1/9 



?2(d-l) 



a 



+ 4ay / d!n (e/A) + 0.5a J y- ) 



In the above bound we leveraged the observation that the right-hand side of the bound in 
Lemma [9] is also an upper bound for the right-hand side of the bound in Lemma [13] for all 
A < 1. It is also crucial to note that the constant 2 d ~ 1 — 1 that appears in the second term of 
the above inequality emerges since for general order-ci tensors we would have to account for 
a total of 2 d ~ 1 terms in the last inequality of Section 13.2.11 Clearly, for order-3 tensors, this 
inequality has a total of four terms. We now set 

x= e - 

n 

(with e = 2.718 . . .) and note that our choice satisfies the constraints on A in both Lemmas [9] 
and [T3j Simple algebraic manipulations and the fact that /3 < a (recall eqns. © and (|10p ) 
allow us to conclude that 



(E \\H\\ q 2 ) 1/q < Cl 2^ d -^a (Vdh^+^/d 



(21) 



where c\ is a small numerical constant. We now remind the reader that the entries Tiijk of 
the tensor H are equal to gijkAijk, where the gijkS are standard Gaussian random variables. 
Thus, 



E ||ft|| 



E„ 



y] OijkAijk ■ &i ® ej (g> e k 
i,j,k 
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Substituting eqn. (|2"T|) to eqn. ([2]) yields 



E, 



.A -.4 






<?\i/« 



(22) 



Finally, we rewrite the E^a 9 as 



/ n \ 9/ 2 

E_4a 9 = E^4 max < max > 4f 



'j 



Hjk 



\k=l 




y/2 



8/2 



y/2 



< E.4 max Y^ A %k + E -4 max ^ ^t? fe + E^ max ^ ^- fc 



'■.y 



\k=l 



i,k 



v i=i 



i,* 



vi=l 



9/2 



More generally, for any order-ci tensor, we get 

d / n 

E^a" <J2^A. .max ^ •4„.<j-it i i i+ i...i d 

Combining the above inequality and eqn. (|22p concludes the proof of Theorem [2j 

4 Proving Theorem [1] 

The main idea underlying our proof is the application of a divide-and-conquer-type strategy 
in order to decompose the tensor A — A as a sum of tensors whose entries are bounded. Then, 
we will apply Theorem [2] and Corollary [2] in order to estimate the spectral norm of each tensor 
in the summand independently. 

To formally present our analysis, let A*- 1 ' g R nx - Xn be a tensor containing all entries 

•A-h...i d °f -A. that satisfy Aj i > 2 J , the remaining entries of A'- 1 ' are set to zero. 
Similarly, we let A^ k ' £ R nx '" xn (for all k > 1) be tensors that contain all entries Ai lM ..i d of 

— I ; the remaining entries of A^ ' are set to zero. 



-fc 11-4111 o-fc+lll^ll 2 



A that satisfy A 2 ,■ G 

Finally, the tensors A> ' (for all k = 1, 2, . . .) contain the (rescaled) entries of the corresponding 
tensor A- k ' that were selected after applying the sparsification procedure of Algorithm 1 to A. 
Given these definitions, 



00 

A = f2^ [k] 

fc=l 



and 



A = Y,^ [k] - 



fc=l 



Let £ = Llog 2 (n d / 2 / In 2 n) J . Then, 



.A-.4 



]T (>]-.#] 



fc=i 
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< 






k=2 



+ 



£ [A™ -A™ 

k=e+t 



Using the inequality (E(x + y) q ) 1/q < (Ex 9 ) 1 / 9 + (Ey q ) l / q , we conclude that 



E 



A- A 



9X1/9 



< (eIUw-Iw 



?\V« 



+ 



fc=2 



+ E 



oo 



9\l/« 

2/ 



fc=£+l 



l/« 



(23) 
(24) 

(25) 



The remainder of the section will focus on the derivation of bounds for terms (|23|) . d2 
and ([25]) of the above equation. 

4.1 Term ( 1231) : Bounding the spectral norm of A^ — A^ 

The main result of this section is summarized in the following lemma. 
Lemma 14. Let a = Inn. Then, 



AW-W q Y /9 < c 2 2^d^Vd^J {7n + 2dlnn 



E 



where C2 is a small numerical constant. 



F > 



Proof. For notational convenience, let B = A*- 1 ' — A^ 1 ' and let B^...^ denote the entries of 
£>. Recall that A*- 1 ' only contains entries of A whose squares are greater than or equal to 

-. Also, recall that A- 1 ' only contains the (rescaled) entries of A- 1 ' that were selected 



-l Mil- 



after applying the sparsification procedure of Algorithm 1 to A. Using these definitions, Bti...i d 
is equal to: 



B 



%\...i d 



l /o-l II-4H! 



> if K.., d < 2 



i ^ ll-Aii! 



since Vi x ...i d = 1 

sA 2 
1 ~ Ph...i d ) Ai...i d ,with probability v%i...i d = ,,7,',^ < I 



puF 



, Ai...i d ,with probability 1 - Pi x ...i d 

We will estimate the quantity (E HBUg) via Corollary [2] as follows: 

I d I n 

(E\\B\n^< 0,2^-^(^1^+^) £ E . .max . £ fl£ .., 4 4 



g/2N 



1/9 



■'d 



(26) 
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Towards that end, we will need to bound the d terms 



»li".)»j-l,*j+li— ,*d 



Since these terms are essentially the same, we will only bound the first one. By Ex < VEx 2 
for any x > 0, 

/ n \ 9/2 



\H=1 



Emax £B£ 



•>*d . 



■V- 



Let Si 2 ...i d = ^2 t Bf i and let S = maxi^,,.^ Si 2 „,i d . In order to bound ES" 3 , we will first find 
probabilistic estimates for Si 2 ___i d and S, and then estimate the quantity E,S q via Lemma [3l 
Simple algebra and our bounds on the entries of Ai x ,„% d that are included in B give: 

E(SU) ,M&, v ar < BU >^, and w..j s !4!i. 

We can now apply Bennett's inequality to bound Si 2 „.i d . Formally, we bound the sum 



Ett^tK 



S -B 2 

|2 u !i-«d' 



since every entry in the above summand is bounded (in absolute value) by one. Clearly, the 
expectation of the above sum is at most n and its variance is at most An. Thus, from Bennett's 
inequality, we get 



n 

S 



£ * BJU>n + i Ue-'/*, 



\h=l 



assuming that t > 6n. Recall that Si 2 ...i d = ^ £^...i , set £ = 6n + 2r for r > 0, and 
rearrange terms to conclude 

Si 2 ...i d > 7 ~^ IIXHJ.) < e-*" < e-. 

An application of the union bound over all n d ~ l possible values of Si 2 ...i d derives 

max Si2 id > JUl^L \\A\\l) < n^e^ = e -r+(d-i)mn_ 

»2,— >*d S J 

Since S = maxj 2 ... j d Si 2 ,„i d , we can apply Lemma [3] with a = 7n, 6 = 2, and h = (d — 1) Inn 
to get 

ES 9 < 2 (7n + 2(d - 1) Inn + 2g) 9 ||.A||^ /s 9 . 

By letting g = In n, we obtain 

E* = E U t X. ... V < 2 ^ + Mh,)||^V 
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The same bound can also be derived for the expectation of all d terms of eqn. (|27p . We now 
substitute these bounds in eqn. ([26]) : 



(E \\B\\ q 2 ) 1/q < Cl 2 3 ( d -V (Vdln~n' + ^d 



2d 



' (7n + 2dlnn)P||^ y /2 



1/9 



< c 2 2 3 ^- V/gVdw/ (7n + 2dl g nn) M& , 
which completes the proof of the lemma. 



D 



4.2 Term (1241) : Bounding the spectral norm of A^ — A 1 ^ for small k 

We now focus on estimating the spectral norm of the tensors A 1 ' — A- ' for 2 < k < 
|_log 2 (n ' 2 /ln n)\ . The following lemma summarizes the main result of this section. 

Lemma 15. Assume 2 < d < 0.5 Inn. Let q = Inn; for all 2 < k < |_log 2 (n ' 2 /ln n)J , 



(e\\aW -A^\\y /q < tx&^rfVdE 



n 



6n d/2 



F i 



where C3 is a small numerical constant. 

Proof. For notational convenience, we let Ai 1 ...i d denote the entries of the tensor A >. Then 

_ Oh...i d Ai...i d 



<Aii. 



■1,1 



Pi\...i d 



(28) 



5, I^V All the entries of A [k] that 



UWf Mil" 

2 k s ' W 



for those entries A^...^ of A satisfying .A? i E 

correspond to entries of A outside this interval are set to zero. The indicator function (5^...^ 

is defined as 



$h. 



■ hi 



1 ,with probability p% x ,„i d - - 
,with probability 1 — Pi x ...i d 



sA 2 



rr- < 1 

Mll^ - 



Notice that Pi 1 ...i d is always in the interval [2 ,2 ' ^) from the constraint on the size of 
Af j . It is now easy to see that EA- k ' = A k K Thus, by applying Theorem O 



E 



A [k] - A [k] 



< 



1 a 1 n 

Cl2 3<-» {yi^r„ + ^) yj Efa cs E4...,-,,,,,, 

\j=l \U=1 



j/2\ 1 /l 



■id 



(29) 



We now follow the same strategy as in Section 14.11 in order to estimate the expectation terms 
in the right-hand side of the above inequality (i.e., we focus on the first term (J = 1) only). 
First, note that 



9/2 



E ,nE4., < 



12, — , Id 



\h=l 



\ 



E max [ \^ Af 



t2, — ,*d 



■ Id 



\h=l 
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Let S i2 „ Ad = J2i 1 -Af 1 .. Ad - Then, using eqn. (58}, the definition of p ix ...i d , and <5 2 _ id = 8 ix ...i d 
we get 

|4 " II x ||4 



fi- • J\ \\A\r J\ 1141, 

1 = lPil...i d %i=X ' 1i...id" j 1= l " " V H— «d 



c. . - V ^-^ /I 2 . - V 5- • " " F .I 2 ■ - V 8- 



Using ^ - d > f" f , we get 5 fa .„ id < ^^ (^ ^...ij, which leads to 

E max J2 {A...iX = E max S*^ < ( ^^ ) E max ( f>i... id ) . (30) 



»2,— ,*d f ' V / t2i».,»d V S / »2,— )*d . . , 

11=1 \ / \»i=l / 

We now seek a bound for the expectation Emaxj 2: ... ) j d ^j (b~i x ...i d ) q ■ The following lemma, 
whose proof may be found in the Appendix, provides such a bound. 

Lemma 16. Assume 2 < d < 0.5 Inn. For any q < Inn, 

/ n \ 1 

I 2^i-'.' I i-M^ , "-- > ""'" 

*2v,2d V . 

\ll=l 



E max V £ i < 2 ( 6n d/2 2" 

i2 '-^Vii=l / v 

Combining Lemma [16] and equation ()30p . we obtain 



■(«£*-*) ,s *(^)' 



The same bound can be derived for all other terms in eqn. (J29J). Thus, substituting in eqn. (|29p 
we get the claim of the lemma. D 



4.3 Term ( 1251) : bounding the tail 

We now focus on values of k that exceed £ = [log 2 (n rf ' 2 /ln n)\ and prove the following 
lemma, which immediately provides a bound for term (|25p . 

Lemma 17. Using our notation, 






/ n d/2 J n n 
2 



Proof. Intuitively, by the definition of A\ ', we can observe that when k is larger than £ = 
[log 2 (n d ' 2 /In n)J , the entries of .4i fc l are very small, whereas the entries of A*- k > are all set to 
zero during the second step of our sparsification algorithm. Formally, consider the sum 



V = Y,U [k] -A^). 



k=£ 



For all k > £+ 1 > log 2 (n ' 2 /ln n), notice that the squares of all the entries of A> ' are at 

most -^yf J F (by definition) and thus the tensors A'- ' are all-zero tensors. The above sum 
now reduces to 

oo 

v = Y,A k \ 

k=e 
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i 2 ll_4ll 
where the squares of all the entries of T> are at most -^m J ■ Since V £ K nx, " x ", using 



1^112 < II^IIf' we immediately get 

oo 



W\ 



k=l 



< 



y v 2 . < 



n 



d/2 ln 2 



n 



F ■ 



□ 



4.4 Completing the proof of Theorem [T] 

Theorem [1] emerges by substituting Lemmas [HI [151 and[T7]to bound terms (f23j) . (f24"|) . and ([25]) . 
First of all, we will apply the three lemmas with q = In n; this immediately implies that the 
quantities d l ' q that appear in the first two lemmas are bounded by a constant. Also, we will 
use the assumption that d < 0.5 Inn; the fact that n < n ' 2 for all d > 2; and the fact that 
2cHnn < 2n, since n > 300 and d < 0.5 Inn. Then, by manipulating/removing constants we 
get: 



E 



A-A 



i 

lnra\ lnr 

2 



, ,, — I n d / 2 Inn „ 
< c' 2 8 d Vd\l HAII 



+ 



+ 



log 2 n 



i d / 2 /\n 2 nj^jc' 3 8 d VdJ 



ir 



d/2 ln 



/? 



I n d/2 J nn 

dnnA/ L4 



In order to further simplify the right-hand side of the above inequality, we observe that for 
n > 300, Vlnn < log 2 (n d ' 2 / In 2 nj . Thus, the second term in the right-hand side of the above 
inequality dominates, and thus there exists some constant C4 such that 



E 



A-A 



1 

lnn\ In n 



< 



c A (log 2 (n d / 2 /ln 2 n)')8 d Vd 
< c 5 (tfV 2 8 d 



IV 



d/2 ln 



;; 



\A\ 



d/2 ln 3 



In the last inequality, we dropped the ln n factor from the denominator of the log 2 expression 
(this results to a slight loss of tightness, but simplifies our final result) and manipulated the 
simplified log 2 expression. Applying Markov's inequality, we conclude that 



A-A 



< 4 ( d 3 / 2 8 d 



TV 



d/2 ln 3 



;; 



LAI 



holds with probability at least 1 — n . Theorem Q] now follows by setting s to the appropriate 
value. 
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5 Open problems 

An interesting open problem would be to investigate whether there exist algorithms that, either 
deterministically or probabilistically, select elements of A to include in A and achieve much 
better accuracy than existing schemes. For example, notice that our algorithm, as well as prior 
ones, sample entries of A with respect to their magnitudes; better sampling schemes might 
be possible. Improved accuracy will probably come at the expense of increased running time. 
Such algorithms would be very interesting from a mathematical and algorithmic viewpoint, 
since they will allow a better quantification of properties of a matrix/tensor in terms of its 
entries. 
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Appendix 

Proof of Lemma [3]. 

Proof, (a) From our assumption, 

F(X > a + b(t + h)) < e~*. 
Let s = a + b(t + h). For any q > 1, 



/■OO /*00 

CX« = / F(X > s)ds q = q F(X > s)s q ~' 1 ds 
Jo Jo 

f a+bh , f°° , (s-a-bh) 

<q s q - 1 ds + q s q - 1 e i ds. 



The first term in the above sum is equal to (a + bh) q . The second term is somewhat harder 
to compute. We start by letting g = a + bh and changing variables, thus getting 

/ s q - l e~—b ds = b (g + bt) q - 1 e- t dt = by2( q . fe^^Y / t^^e^dt. 

Ja+bh Jo i=Q V % ) Jo 

We can now integrate by parts and get 

/>oo 

/ t q - 1 - i e- t dt=(q-l-i)\<q q - 1 - i for all i = 0, ..., q - 1. 
Jo 

Combining the above, 

q [^ s q - l e- i ^^ 1 ds<qbY ( q ~ l \{bq) q - l - i g i = qb{bq + g) q - 1 . 

Ja+bh ^ V * / 

Finally, 

EX q <(a + bh) q + bq{bq + g)*' 1 < 2(o + bh + bq) q , 

which concludes the proof of the first part. 

(b) From our assumption and since t and h are non-negative, we get 

P(M >a + b(t + s/h)) < e -(*+^) 2 + /l < e"* 2 . 

Let s = a + b\fh + tb. For any q > 1 , 

EM 9 = / P(M > s)ds q = q P(M > s)s q ~ 1 ds 
Jo Jo 

f-a+bVh /.oo (s _ a _ tv ^)2 

<g/ s q ~ 1 ds + q s q ~ 1 e S 2 ds. 

Jo Ja+b\fh 

The first term in the above sum is equal to ( a + b\fh\ . We now evaluate the second integral. 



Let g = a + fry/i and perform a change of variables to get 

/ s q - l e 6= ds = b {g + bt) q ' l e^ dt 

Ja+bVh Jo 



'a+bVh JO 
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b ii (% ^'"'"V r^-^e-^dt. 



By integrating by parts we get (see below for a proof of eqn. ([3T]) ): 
Thus, using g = a + byh, 

Finally, we conclude that 

EM 9 < (a + bVhY + V2bq(a + bVh + bJ-] 

< (a + bVh + bJ^j +^(a + bVh + bJ^j 

< 3y/q(a + by/h + bJl) , 

which is the claim of the lemma. In the above we used the positivity of a, 6, and h as well as 
the fact that 1 + y/Aq < 3^q for all q > 1. □ 

Proof of eqn. (|3l|). 

Proof. We now compute the integral L t q e~ l dt. Integrating by parts, we get 

tl e - t2 dt = r / -/' /_J <K 



n 2 j 



1 1 /'OO 

- / t»--<-'-,ll. 



'o 
When g is even, we get 

/•oo /I \ 9/ 2 /-oo nr /i \ <?/ 2 

where g!! = g(g — 2)(g — 4) • • • . If g is odd, then 



oo 



a 



1 \L9/2J /-oo 2 / lX L<//2j + l 



t<? e - f dt = ( - ) (g-1)!!/ te-*dt=(-J (g-l)H. 



We thus conlude 



^^mr^<-M^r<-M^'' 2 
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D 



Proof of Lemma 31 

Proof. We start by noting that every vector z 6 S n_1 can be written as z = x + /i, where x 
lies in N and \\h\\ 2 < e. Using the triangle inequality for the tensor spectral norm, we get 

||.A|| 2 = sup ||.A Xi z\\ 2 < sup ||.4 Xi x|| 2 + sup \\A Xi /t|| 2 . 

It is now easy to bound the second term in the right-hand side of the above equation by e ||^4|| 2 . 
Thus, 

\\A\\ 2 < - sup||.Axix|| 2 . 

1 — e xeN 

Repeating the same argument recursively for the tensor A x i x etc. we obtain the lemma. □ 

Proof of Lemma 1161 

Proof. Let S = max^...^ Y17i=i ^h—id- We wm ^ rs ^ estimate the probability P(5 > t) and 
then apply Lemma in order to bound the expectation KS q . Recall from the definition of 
5 il ... id that E {5 il ...i d - Pi^.ta) = and let 



X =Y^ ( S h-id - Ph.-id) ■ 
H=l 

We will apply Bennett's inequality in order to bound X. Clearly l^...^ — P%x...i d \ < 1 and 

n n 

Var(X) = Y^ Var ( 6 ii-i d ~ Ph...i d ) = Yl E (6 h .. Ad - ^...J 2 

n n 

= Yj (Pii-id-Pii...i d ) < YjP^-^- 

Recalling the definition of Pi 1 ...i d and the bounds on the Ai 1 ,„i d , s, we get 

Var (X) < Y ^±# < n2-( fc " 1 ). 
We can now apply Bennett's inequality in order to get 



e-*/ 2 , 



P(X > t) = P £ 5 n ... id > Y Ph...i d + t)< 

for any t > 3n2~( fc_1 ) /2. Thus, with probability at least 1 — e _t / 2 , 

E5 il ... id <n2-( fe - 1 )+t, 

since ££ =1 Pfe...^ ^ »2-( fe_1 ). Setting t = (3n2 - ( fc-1 ) /2) + 2t for any r > we get 



V«l=l 
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Taking a union bound yields 

J2 d h-i d > 5n2~ fc + 2r j < n d ~V T = e -r+(d-i)lnn^ 



max 
i2,—,id 



where the n term appears because of all possible choices for the indices 12, ■ ■ ■ , id- Applying 
Lemma [3] with a = 5n2~ k , 6 = 2, and h = (d — 1) Inn, we get 

E maxV^ ,J < 2f5n2 _fc + 2((i-l)lnn + 2gy < 2 f5n2 _fc + 2dlnnV (32) 

yfc'-'Stt J v ' v y 

The last inequality follows since q < Inn. We now note that, clearly, 5n2 _fe < 5n d ' 2 2~ k for 
all d > 2. Also, using our assumption d < 0.5 log n as well as k < I log 2 (n rf ' 2 /ln n) J , we can 
prove that 

2(ilnn<n d/2 2- fe . 

Substituting the above bounds in eqn. (I32j) concludes the proof of the lemma. 

□ 
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